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Abstract 

Many  academic  papers  imply  that  parallel  computing 
is  only  worthwhile  when  applications  achieve  nearly  lin¬ 
ear  speedup  (i.e.,  execute  nearly  p  times  faster  on  p 
processors).  This  note  shows  that  parallel  computing 
is  cost-effective  whenever  speedup  exceeds  costup — the 
parallel  system  cost  divided  by  uniprocessor  cost.  Fur¬ 
thermore,  when  applications  have  large  memory  require¬ 
ments  (e.g.,  512  megabytes),  the  costup — and  hence 
speedup  necessary  to  be  cost-effective — can  be  much  less 
than  linear. 

Categories  and  Subject  Descriptors:  C.1.2  [Proces¬ 
sor  Architectures]:  Multiple  Data  Stream  Architec¬ 
tures  (multiprocessors);  C.4  [Computer  Systems  Or¬ 
ganization]:  Performance  of  Systems;  K.6  [Manage¬ 
ment  of  Computing  and  Information  Systems]: 
Installation  Management — pricing  and  resource  alloca¬ 
tion. 


*To  the  referees:  We  intend  this  to  be  a  short,  pointed 
note  in  the  tradition  of  Gnstafson’s  “Reevalnating  Amdahl’s  Law” 
(CACM  May  ’88,  pp.  532-533)  or  J.  Smith’s  “Characterizing 
Compnter  Performance  with  a  Single  Nnmber”  (CACM  Oct  ’88, 

pp.  1202-1206). 

tThis  paper  generalizes  the  cost  model  introdnced  by  Falsafi 
and  Wood  (Cost/Performance  of  a  Parallel  Compnter  Simnlator, 
in  Proceedings  of  PADS  ’94,  July  1994).  This  work  is  supported 
in  part  by  Wright  Laboratory  Avionics  Directorate,  Air  Force 
Material  Command,  USAF,  under  grant  #F33615-94-l-1525  and 
ARPA  order  no.  B550,  NSF  PYI  Awards  MIP-8957278  and  CCR- 
9157366,  NSF  Grant  MIP-9225097,  and  donations  from  A.T.&T. 
Bell  Laboratories,  Digital  Equipment  Corporation,  Sun  Microsys¬ 
tems,  Thinking  Machines  Corporation,  and  Xerox  Corporation. 
The  U.S.  Government  is  authorized  to  reproduce  and  distribute 
reprints  for  Governmental  purposes  notwithstanding  any  copy¬ 
right  notation  thereon.  The  views  and  conclusions  contained 
herein  are  those  of  the  authors  and  should  not  be  interpreted 
as  necessarily  representing  the  official  policies  or  endorsements, 
either  expressed  or  implied,  of  the  Wright  Laboratory  Avionics 
Directorate  or  the  U.S.  Government. 


Introduction 

Suppose  that  you  need  to  run  many  simnlations  that 
reqnire  large  amonnts  of  memory.  Yon  may  rnn  the 
simnlations  on  a  nniprocessor  or  a  p-processor  paral¬ 
lel  system.  Yon  know  that  yonr  simnlations  cannot  be 
parallelized  perfectly,  so  speednps  will  be  less  than  lin¬ 
ear.  Parallel  simnlation  will  rednce  response  time,  bnt 
yonr  task  is  to  maximize  job  thronghpnt  per  nnit  cost, 
or  eqnivalently,  to  minimize  cost-performance  (cost  di¬ 
vided  by  performance). 

Which  system  do  yon  select?  Conventional  wisdom 
says  nse  the  nniprocessor,  since  speednps  are  less  than 
linear^.  We  show,  however,  that  the  parallel  system 
provides  better  (i.e.,  lower)  cost-performance  whenever 
speednp  exceeds  costup — the  parallel  system  cost  di¬ 
vided  by  nniprocessor  cost.  Onr  resnlt  is  not  tied  to 
simnlation,  bnt  holds  for  all  applications. 

Furthermore,  we  find  that  when  applications  have 
large  memory  reqnirements  (e.g.,  512  megabytes), 
the  costnp — and  hence  speednp  necessary  to  be  cost- 
effective — can  be  mnch  less  than  linear.  This  is  becanse 
the  parallel  system  does  not  need  p  times  the  memory 
of  the  nniprocessor,  since  parallelizing  a  job  rarely  mnl- 
tiplies  its  memory  reqnirements  by  p. 

Three  decades  ago  Amdahl  argned  that  each  million- 
instrnctions-per-second  (MIPS)  of  processing  power 
shonld  be  accompanied  by  1  megabyte  of  memory  [4] 
(p.  17).  Onr  resnlts  can  be  interpreted  as  the  converse  of 
Amdahl’s  dictnm:  Each  1  megabyte  of  memory  shonld 
be  accompanied  by  1  MIPS  of  processing  power.  If  one 
processor  does  not  provide  enongh  power,  mnltiple  pro¬ 
cessors  shonld  be  nsed  to  make  effective  nse  of  the  mem¬ 
ory’s  capacity  and  bandwidth. 


^Alternatively,  you  could  use  p  uniprocessors  to  increase 
throughput  and  yet  retain  the  same  cost-performance  as 
one  uniprocessor — p  times  the  cost  divided  by  p  times  the 
performance. 
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Speedup  &  Costup 


To  formalize  our  results,  let  the  time  to  execute  a  job 
with  p  processors  be  time{p).  Parallel  system  perfor¬ 
mance  is  often  characterized  nsing  speedup: 


speedup{p)  = 


lltime{p) 


time{p) 


Let  the  cost  with  p  processors  be  cost{p).  The  cost 
conld  be  jnst  the  hardware  cost  (for  processors,  mem¬ 
ory,  I/O  devices,  backplanes,  power  snpplies,  etc.)  or 
inclnde  software  costs  (the  costs  of  bnilding  the  parallel 
application  and  system  software,  amortized  over  their 
expected  lifetime).  Analogons  to  speednp,  we  introdnce 
costup  to  characterize  parallel  system  cost: 


costup{p)  = 


cost{p) 

cost{l) 


To  determine  the  cost-effectiveness  of  a  system,  per¬ 
formance  and  cost  are  often  combined  to  get  cost- 
performance: 

cost-performance{p)  =  — ^^)' 

Parallel  compnting  is  more  cost-effective  whenever  its 
cost-performance  is  better  (smaller)  than  a  nniproces- 
sor’s: 


cost-performance{p)  <  cost-performance{l). 


determined  by  the  job’s  working  set  reqnirements  rather 
than  by  the  maximnm  memory  referenced.)  Usnally  m' 
is  larger  than  m  to  permit  the  replication  of  some  ap¬ 
plication  or  operating  system  code  or  data  strnctnres, 
or  becanse  parallel  working  sets  are  larger.  When  m 
is  small  or  p  is  large,  m'  may  be  mnch  larger  than  m, 
becanse  m/p  pnts  too  few  memory  chips  with  each  pro¬ 
cessor  to  adeqnately  satisfy  the  processor’s  bandwidth 
reqnirements.  Consider  a  processor  that  needs  an  8-byte 
datapath  to  each  of  two  interleaved  banks.  If  the  mem¬ 
ory  is  implemented  with  4  megaword-by-4  bit  dynamic 
RAMs,  the  minimnm  memory  size  per  processor  is  64 
megabytes  (4  megaword  •  8  bytes  per  bank  •  2  banks). 

Usnally,  however,  significant  memory  cost  will  tend 
to  make  costnps  less  than  linear.  This  is  becanse  a  par¬ 
allel  system  does  not  need  p  times  the  memory  of  the 
nniprocessor,  since  parallelizing  a  job  rarely  mnltiplies 
its  memory  reqnirements  by  p  (i.e.,  m'  <C  p-m).  We  can 
emphasize  this  in  new  speednp  and  costnp  eqnations: 


speedup{p,  m,  m')  - 
costup{p,  m,  m') 


time{l,  m) 
time{p,  m')  ’ 
cost{p,  m') 
cost{l,  m) 


Parallel  compnting  is  more  cost-effective  when: 


speedup{p,m,m')  >  costup{p,m,m'). 


Bnt  how  does  memory  affect  real  costnps? 


Snbstitntion  reveals  onr  principal  resnlt  that  parallel 
compnting  is  more  cost-effective  whenever: 

speedup{p)  >  costup{p). 

Onr  resnlt  is  trne,  in  general,  and  does  not  depend 
on  the  assnmptions  we  make  below  to  calcnlate  specific 
valnes.  What  constitntes  cost  depends  npon  one’s  point 
of  view.  A  compnter  vendor  may  see  costs  as  the  snm 
of  research  and  development,  components,  mannfactnr- 
ing,  and  advertising,  while  a  cnstomer  may  view  cost  as 
pnr chase  price. 

While  this  theoretical  resnlt  is  interesting,  it  has  prac¬ 
tical  impact  when  costnps  are  less  than  linear.  We  show 
below  that  this  happens  when  memory  dominates  sys¬ 
tem  cost. 

Remember  Memory 

Memory  is  an  important  component  in  the  hardware 
costs  of  today’s  machines.  Assnme  that  onr  job  reqnires 
m  megabytes  on  a  nniprocessor  and  m'  megabytes  with 
p  processors.  (If  virtnal  memory  is  nsed,  m  and  m'  are 


A  Multi  Example 

As  a  concrete  example,  we  nse  cnrrent  Silicon  Graphics 
(SGI)  prices  to  show  that  actnal  costnps  can  be  mnch 
less  than  linear  for  systems  with  hnndreds  of  megabytes 
of  main  memory.  We  consider  hardware  costs,  bnt 
not  software  ones  since  we  do  not  know  how  to  non- 
controversially  measnre  the  latter.  All  prices  are  list 
prices  in  U.S.  dollars  as  of  Jnly  15,  1994  [5].  We  ig¬ 
nore  the  volnme  disconnts  that  may  favor  nniprocessors. 
Since  we  take  the  ratio  of  two  list  prices,  onr  qnanti- 
tative  resnlts  also  hold  exactly  when  a  vendor  gives  a 
cnstomer  the  same  disconnt  on  both  systems. 

Silicon  Graphics  prodncts  range  from  low-cost  desk¬ 
top  workstations  to  million-dollar  shared-memory  mnl- 
tiprocessors.  We  focns  on  their  server  prodncts,  so 
onr  comparison  will  not  be  biased  by  expensive  graph¬ 
ics  engines  and  monitors.  At  the  low-end,  the  Sili¬ 
con  Graphics  GHALLENGE  S  is  a  highly-competitively- 
priced  monitor-less  nniprocessor  workstation,  with  a  list 
price  of  $16,600.  However,  becanse  it  is  packaged  as  a 
small  desktop  nnit,  the  GHALLENGE  S  has  a  maximnm 
memory  size  of  256  megabytes.  While  256  megabytes 
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100  1000  100  1000 
m  megabytes  m  megabytes 


Figure  1:  SGI  costups  with  no  memory  overhead  Figure  2:  SGI  costups  with  100%  memory  overhead 

Parallel  computing  is  more  cost-effective  when  speedups  exceed  Parallel  computing — even  if  it  uses  100%  more  memory 

the  costup(p,m,m')  for  p  processors  (different  lines)  and  memory  (m'  =  2m) — is  more  cost-effective  when  speedups  exceed  the 

size  m  megabytes  (x-axis).  This  graph  assumes  no  memory  over-  costup{p,m,m'). 

head  (m'  =  m).  The  “uniprocessor”  line  represents  a  uniprocessor  _ 

with  degenerate  costup  of  1. 


is  sufficient  for  many  computations,  it  is  far  too  small 
for  many  of  the  large  and  long  running  applications  we 
might  want  to  parallelize. 

To  achieve  larger  memory  capacity  requires  purchas¬ 
ing  a  deskside  configuration,  such  as  the  GHALLENGE 
DM.  These  deskside  units  can  support  upto  6  gigabytes 
of  physical  memory,  but  at  a  significant  premium:  a 
uniprocessor  GHALLENGE  DM  lists  for  about  $38,400 
plus  about  $100  per  megabyte.  This  results  in  a  unipro¬ 
cessor  cost  of: 

cost{l,  m)  =  $38400  +  $100  •  m. 

For  comparison,  we  use  the  Silicon  Graphics  GHAL¬ 
LENGE  XL  as  the  parallel  system^.  The  GHALLENGE 
XL  is  a  rack-mounted  bus-based  multiprocessor  that 
supports  2  to  40  processors  with  a  cost  that  closely  fol¬ 
lows: 

cost{p,  m')  =  $81600  +  $20000  •  p  +  $100  •  m'. 
Substitution  reveals: 

,  „  2.125  +  0.521 -pT  0.0026 -m' 

costupip.m.m  )  = - . 

^  1  +  0.0026  •  m 

Figure  1  illustrates  costups  with  SGI  prices  and  the 

^This  comparison  is  somewhat  biased  towards  the  uniproces¬ 
sor,  since  the  CHALLENGE  DM  uses  a  lOOMHz  R4400  processor 
rather  than  the  150MHz  R4400  processor  of  the  CHALLENGE 
XL.  Silicon  Graphics  does  not  currently  sell  a  uniprocessor  desk¬ 
side  unit  with  the  faster  processor. 


optimistic  assumption  that  parallel  computing  requires 
no  additional  memory  (m'  =  m).  Different  lines  repre¬ 
sent  the  number  of  processors  p,  while  the  x-axis  gives 
the  memory  size  m  in  megabytes.  The  data  supports 
our  principal  result: 


With  real  price  data,  parallel  computing 
can  be  more  cost-effective  at  speedups  much 
less  than  p  for  large  but  practical  memory 
sizes. 


For  systems  requiring  512  megabytes,  for  example,  8-, 
16-,  and  32-processor  systems  are  more  cost-effective 
than  a  uniprocessor  when  speedups  exceed  3.3,  5.0, 
and  8.6,  respectively.  These  speedups  correspond  to 
efficiencies — speedup{p,m)  fp — of  only  0.41,  0.32,  and 
0.27.  While  512  megabytes  may  sound  like  a  lot  of 
memory  for  a  uniprocessor,  it  is  only  64,  32,  and  16 
megabytes  per  processor  for  8-,  16-,  and  32-processor 
systems. 

But  what  happens  when  parallel  computing  requires 
more  memory  than  using  a  uniprocessor?  Figure  2  il¬ 
lustrates  costups  with  100%  memory  overhead;  that  is, 
m'  =  2  ■  m.  Our  principal  result  is  qualitatively  un¬ 
changed:  parallel  computing  can  still  be  cost-effective 
at  speedups  much  less  than  linear.  When  memory  is 
small,  doubling  parallel  memory  cost  has  little  effect. 
When  it  is  large,  costups  approach  2.0  instead  of  1.0, 
but  are  still  much  less  than  linear. 
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More  Generally 

While  SGI  costups  are  interesting,  we  can  generalize  the 
resnlt  nsing  a  simple  hardware  cost  model: 

cost{l,m)  =  f{l)+g{m), 
cost{p,m')  =  f{p)+g{m'), 

where  g  is  memory  cost  and  /  is  the  cost  of  everything 
else  (e.g.,  processor(s),  disks,  backplane,  power  snpply), 
normalized  so  that  /(I)  =  1.  This  model  assnmes  that 
memory  costs  the  same  in  a  nniprocessor  or  a  parallel 
system  of  any  size.  While  this  assnmption  seems  rea¬ 
sonable  given  cnrrent  technologies,  marketing  consider¬ 
ations  can  make  parallel  system  memory  more  expensive 
[3].  Using  this  model,  costnp  is: 

l+g{m) 

If  memory  costs  are  negligible,  then  costnp  is  f{p). 
The  valne  of  f{p)  can  be  less  than  p  if  there  is  a  signif¬ 
icant  fixed  cost  for  both  a  nniprocessor  and  a  parallel 
system.  On  the  other  hand,  /(p)  can  be  more  than  p  if 
the  fixed  cost  for  a  nniprocessor  is  mnch  less  than  the 
fixed  cost  for  a  parallel  system  or  the  interconnection 
network  constitntes  a  large  part  of  the  parallel  system 
cost.  When  /(p)  >  p,  a  parallel  system  cannot  be  cost- 
effective  (even  with  linear  speednps)  nntil  memory  costs 
become  significant. 

Onr  principal  resnlt,  however,  is  manifest  when  the 
memory  costs  are  significant.  When  memory  cost  dom¬ 
inates,  the  costnp  approaches  g{m')l g{m).  If  g{m)  is 
proportional  to  m,  then  g{m')lg{m)  =  m' jm  and  is 
likely  to  be  mnch  less  than  p.  More  importantly,  costnps 
can  be  small  when  even  memory  cost  are  significant  bnt 
not  dominant  (e.g.,  if  memory  is  half  the  nniprocessor ’s 
cost,  g{m)  =  1.0). 

This  resnlt  may  come  as  a  snrprise  to  those  who  de¬ 
fine  parallel  system  efficiency  with  speedup{p)  f p.  With 
this  definition,  “efficiency”  is  maximized  at  1.0  when 
p  =  1.  Why  then  do  we  find  parallel  systems — with 
even  modest  speednps — to  be  more  “efficient”  ?  The  ex¬ 
planation  is  that  speedup{p)fp  is  processor-centric:  it 
measnres  the  ntilization  of  processors  bnt  ignores  mem¬ 
ory.  Onr  resnlts  show  that  when  memory  is  snfficiently 
large  (and  expensive),  more  than  one  processor  shonld 
be  nsed  to  make  effective  nse  of  the  memory  capacity 
and  bandwidth.  This  resnlt  may  also  call  into  qnestion 
the  wisdom  of  time-sharing  large-memory  jobs  withont 
considering  memory-processor  interaction  metrics  like 
the  space-time  prodnct  [1]. 


Related  Work 

Few  papers  address  the  cost-effectiveness  of  parallel 
compnting.  Fuller  [3]  compared  the  multiprocessor 
CMU  C.mmp  (based  on  the  DEC  PDF  11/20  and  11/40 
processors)  with  the  uniprocessor  DEC  PDP-10.  He 
found  C.mmp  to  be  three  to  four  times  more  cost- 
effective;  however,  his  results  were  dependent  upon  the 
specific  processor  and  (differing)  memory  costs  of  these 
systems. 

Falsafi  and  Wood  [2]  investi¬ 

gated  the  cost-effectiveness  of  the  Wisconsin  Wind  Tun¬ 
nel  (WWT)  parallel  simulator.  WWT  runs  on  a  Think¬ 
ing  Machines  CM-5  (the  host),  but  models  the  proces¬ 
sors  and  memories  of  alternative  cache-coherent  shared- 
memory  machines  (the  targets)  with  enough  detail  to 
run  target  executables.  Falsafi  and  Wood  found  that 
WWT  is  more  cost-effective  than  uniprocessor  simula¬ 
tions  for  studying  large  target  systems  (e.g.,  32  or  more 
nodes),  because  those  runs  demand  vast  host  memory. 
Our  work  generalizes  their  result. 

Conclusions 

This  paper  compared  the  cost-performance  of  a  unipro¬ 
cessor  and  a  parallel  system  for  maximizing  through¬ 
put.  We  found  that  parallel  computing  is  cost-effective 
whenever  speedup  exceeds  costup — the  parallel  system 
cost  divided  by  uniprocessor  cost.  Furthermore,  when 
applications  have  large  memory  requirements  (e.g.,  512 
megabytes),  the  costup — and  hence  speedup  necessary 
to  be  cost-effective — can  be  much  less  than  linear.  In¬ 
tuitively,  when  memory  is  sufficiently  large  (and  expen¬ 
sive),  more  than  one  processor  may  be  needed  to  effi¬ 
ciently  utilize  the  memory. 

Amdahl  argued  that  each  MIPS  of  processing  power 
should  be  accompanied  by  1  megabyte  of  memory.  We 
find  the  converse:  Each  1  megabyte  of  memory  should 
be  accompanied  by  1  MIPS  of  processing  power.  If  one 
processor  does  not  provide  enough  power,  multiple  pro¬ 
cessors  should  be  used  to  balance  the  memory’s  capacity 
and  bandwidth. 
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